60 research outputs found

    Mining frequent biological sequences based on bitmap without candidate sequence generation

    Get PDF
    Biological sequences carry a lot of important genetic information of organisms. Furthermore, there is an inheritance law related to protein function and structure which is useful for applications such as disease prediction. Frequent sequence mining is a core technique for association rule discovery, but existing algorithms suffer from low efficiency or poor error rate because biological sequences differ from general sequences with more characteristics. In this paper, an algorithm for mining Frequent Biological Sequence based on Bitmap, FBSB, is proposed. FBSB uses bitmaps as the simple data structure and transforms each row into a quicksort list QS-list for sequence growth. For the continuity and accuracy requirement of biological sequence mining, tested sequences used during the mining process of FBSB are real ones instead of generated candidates, and all the frequent sequences can be mined without any errors. Comparing with other algorithms, the experimental results show that FBSB can achieve a better performance on both run time and scalability

    Representative Points and Cluster Attributes Based Incremental Sequence Clustering Algorithm

    Get PDF
    In order to improve the execution time and clustering quality of sequence clustering algorithm in large-scale dynamic dataset, a novel algorithm RPCAISC (Representative Points and Cluster Attributes Based Incremental Sequence Clustering) was presented. In this paper, density factor is defined. The primary representative point that has a density factor less than the prescribed threshold will be deleted directly. New representative points can be reselected from non-representative points. Moreover, the representative points of each cluster are modeled using the K-nearest neighbor method. The definition of the relevant degree (RD) between clusters was also proposed. The RD is computed by comprehensively considering the correlations of objects within a cluster and between different clusters. Then, whether the two clusters need to merge is determined. Additionally, the cluster attributes of the initial clustering are retained with this process. By calculating the matching degree between the incremental sequence and the existing cluster attributes, dynamic sequence clustering can be achieved. The theoretic experimental results and analysis prove that RPCAISC has better correct rate of clustering results and execution efficiency

    An algorithm for fast mining top-rank-k frequent patterns based on node-list data structure

    Get PDF
    Frequent pattern mining usually requires much run time and memory usage. In some applications, only the patterns with top frequency rank are needed. Because of the limited pattern numbers, quality of the results is even more important than time and memory consumption. A Frequent Pattern algorithm for mining Top-rank-K patterns, FP_TopK, is proposed. It is based on a Node-list data structure extracted from FTPP-tree. Each node is with one or more triple sets, which contain supports, preorder and post-order transversal orders for candidate pattern generation and top-rank-k frequent pattern mining. FP_TopK uses the minimal support threshold for pruning strategy to guarantee that each pattern in the top-rank-k table is really frequent and this further improves the efficiency. Experiments are conducted to compare FP_TopK with iNTK and BTK on four datasets. The results show that FP_TopK achieves better performance

    An Empirical study on Predicting Blood Pressure using Classification and Regression Trees

    Get PDF
    Blood pressure diseases have become one of the major threats to human health. Continuous measurement of bloodpressure has proven to be a prerequisite for effective incident prevention. In contrast with the traditional prediction models with lowmeasurement accuracy or long training time, non-invasive blood pressure measurement is a promising use for continuousmeasurement. Thus in this paper, classification and regression trees (CART) are proposed and applied to tackle the problem. Firstly,according to the characteristics of different information, different CART models are constructed. Secondly, in order to avoid theover-fitting problem of these models, the cross-validation method is used for selecting the optimum parameters so as to achieve thebest generalization of these models. Based on the biological data collected from CM400 monitor, this approach has achieved betterperformance than the common existing models such as linear regression, ridge regression, the support vector machine and neuralnetwork in terms of accuracy rate, root mean square error, deviation rate, Theil IC, and the required training time is also comparativelyless. With increasing data, the accuracy rate of predicting systolic blood pressure and diastolic blood pressure by CART exceeds 90%,and the training time is less than 0.5s

    A Phase Change Memory Chip Based on TiSbTe Alloy in 40-nm Standard CMOS Technology

    Get PDF
    In this letter, a phase change random access memory (PCRAM) chip based on Ti0.4Sb2Te3 alloy material was fabricated in a 40-nm 4-metal level complementary metal-oxide semiconductor (CMOS) technology. The phase change resistor was then integrated after CMOS logic fabrication. The PCRAM was successfully embedded without changing any logic device and process, in which 1.1 V negative-channel metal-oxide semiconductor device was used as the memory cell selector. The currents and the time of SET and RESET operations were found to be 0.2 and 0.5 mA, 100 and 10 ns, respectively. The high speed performance of this chip may highlight the design advantages in many embedded applications

    CDA: A clustering degree based influential spreader identification algorithm in weighted complex network

    Get PDF
    Identifying the most influential spreaders in a weighted complex network is vital for optimizing utilization of the network structure and promoting the information propagation. Most existing algorithms focus on node centrality, which consider more connectivity than clustering. In this paper, a novel algorithm based on clustering degree algorithm (CDA) is proposed to identify the most influential spreaders in a weighted network. First, the weighted degree of a node is defined according to the node degree and strength. Then, based on the node weighted degree, the clustering degree of a node is calculated in respect to the network topological structure. Finally, the propagation capability of a node is achieved by accounting the clustering degree of the node and the contribution from its neighbors. In order to evaluate the performance of the proposed CDA algorithm, the susceptible-infected-recovered model is adopted to simulate the propagation process in real-world networks. The experiment results have showed that CDA is the most effective algorithm in terms of Kendall's tau coefficient and with the highest accuracy in influential spreader identification compared with other algorithms such as weighted degree centrality, weighted closeness centrality, evidential centrality, and evidential semilocal centrality

    DMP_MI: an effective diabetes mellitus classification algorithm on imbalanced data with missing values

    Get PDF
    © 2019 Institute of Electrical and Electronics Engineers Inc.. All rights reserved. As a widely known chronic disease, diabetes mellitus is called a silent killer. It makes the body produce less insulin and causes increased blood sugar, which leads to many complications and affects the normal functioning of various organs, such as eyes, kidneys, and nerves. Although diabetes has attracted high attention in research, due to the existence of missing values and class imbalance in the data, the overall performance of diabetes classification using machine learning is relatively low. In this paper, we propose an effective Prediction algorithm for Diabetes Mellitus classification on Imbalanced data with Missing values (DMP_MI). First, the missing values are compensated by the Naïve Bayes (NB) method for data normalization. Then, an adaptive synthetic sampling method (ADASYN) is adopted to reduce the influence of class imbalance on the prediction performance. Finally, a random forest (RF) classifier is used to generate predictions and evaluated using comprehensive set of evaluation indicators. Experiments performed on Pima Indians diabetes dataset from the University of California at Irvine, Irvine (UCI) Repository, have demonstrated the effectiveness and superiority of our proposed DMP_MI

    Security feature measurement for frequent dynamic execution paths in software system

    Get PDF
    © 2018 Qian Wang et al. The scale and complexity of software systems are constantly increasing, imposing new challenges for software fault location and daily maintenance. In this paper, the Security Feature measurement algorithm of Frequent dynamic execution Paths in Software, SFFPS, is proposed to provide a basis for improving the security and reliability of software. First, the dynamic execution of a complex software system is mapped onto a complex network model and sequence model. This, combined with the invocation and dependency relationships between function nodes, fault cumulative effect, and spread effect, can be analyzed. The function node security features of the software complex network are defined and measured according to the degree distribution and global step attenuation factor. Finally, frequent software execution paths are mined and weighted, and security metrics of the frequent paths are obtained and sorted. The experimental results show that SFFPS has good time performance and scalability, and the security features of the important paths in the software can be effectively measured. This study provides a guide for the research of defect propagation, software reliability, and software integration testing

    Ensemble multiboost based on ripper classifier for prediction of imbalanced software defect data

    Get PDF
    Identifying defective software entities is essential to ensure software quality during software development. However, the high dimensionality and class distribution imbalance of software defect data seriously affect software defect prediction performance. In order to solve this problem, this paper proposes an Ensemble MultiBoost based on RIPPER classifier for prediction of imbalanced Software Defect data, called EMR_SD. Firstly, the algorithm uses principal component analysis (PCA) method to find out the most effective features from the original features of the data set, so as to achieve the purpose of dimensionality reduction and redundancy removal. Furthermore, the combined sampling method of adaptive synthetic sampling (ADASYN) and random sampling without replacement is performed to solve the problem of data class imbalance. This classifier establishes association rules based on attributes and classes, using MultiBoost to reduce deviation and variance, so as to achieve the purpose of reducing classification error. The proposed prediction model is evaluated experimentally on the NASA MDP public datasets and compared with existing similar algorithms. The results show that EMR-SD algorithm is superior to DNC, CEL and other defect prediction techniques in most evaluation indicators, which proves the effectiveness of the algorithm

    Health data driven on continuous blood pressure prediction based on gradient boosting decision tree algorithm

    Get PDF
    Diseases related to issues with blood pressure are becoming a major threat to human health. With the development of telemedicine monitoring applications, a growing number of corresponding devices are being marketed, such as the use of remote monitoring for the purposes of increasing the autonomy of the elderly and thus encouraging a healthier and longer health span. Using machine learning algorithms to measure blood pressure at a continuous rate is a feasible way to provide models and analysis for telemedicine monitoring data and predicting blood pressure. For this paper, we applied the gradient boosting decision tree (GBDT) while predicting blood pressure rates based on the human physiological data collected by the EIMO device. EIMO equipment-specific signal acquisition includes ECG and PPG. In order to avoid over-fitting, the optimal parameters are selected via the cross-validation method. Consequently, our method has displayed a higher accuracy rate and better performance in calculating the mean absolute error evaluation index than methods, such as the traditional least squares method, ridge regression, lasso regression, ElasticNet, SVR, and KNN algorithm. When predicting the blood pressure of a single individual, calculating the systolic pressure displays an accuracy rate of above 70% and above 64% for calculating the diastolic pressure with GBDT, with the prediction time being less than 0.1 s. In conclusion, applying the GBDT is the best method for predicting the blood pressure of multiple individuals: with the inclusion of data such as age, body fat, ratio, and height, algorithm accuracy improves, which in turn indicates that the inclusion of new features aids prediction performance
    • …
    corecore